Let’s assume you are an HR manager (with a soft spot for statistics) and you are trying to find out how to best analyse data from a survey about employee satisfaction.
The consultant you hired came back with a bunch of colorful charts and employee satisfaction numbers that were precise to 3 digits, but you couldn’t stop thinking of you professors rambling that firms would learn more about their employees if they used their data better and did some multilevel analysis.
You weren’t entirely convinced, because you found his argumentation to abstract, but you also were a good hacker and thought now is the time to just simulate some data and see if one is really doing better with a multilevel analysis.
Here are the things you know about your firm:
Here is this expressed in some variables.
n_departments = 15
set.seed(123)
n_employees =
rep(c(5,22,98), each = 5) +
rbinom(n_departments,5,.5)
mean_firm_satisfaction = 3
sd_dep_satisfaction = .25
Here is a simple model of the data generating process, and its translation in simulation code:
| in words | model | simulation code |
|---|---|---|
|
The average satisfaction in departments has a normal distribution with a
mean of 3 and a sd of 0.25 |
\(\mu_d = normal(3, .25)\) |
|
|
The satisfaction of individual employees varies with sd = 1 around the
average satisfaction in departments. |
\(S_i = normal(\mu_d, 1)\) |
|
Now we can simulate satisfaction of employees in the firm:
set.seed(1)
# mean satisfaction in departments
department_mus =
rnorm(n_departments,
mean_firm_satisfaction,
sd_dep_satisfaction)
# employee level satisfaction
firm_satisfaction = c()
for (d in 1:n_departments) {
# employee satisfaction varies around department mean.
# Due to measurement error, the average of employee satisfaction
# deviates from the true department mean.
employee_satisfaction =
rnorm(n_employees[d],
department_mus[d],
1)
firm_satisfaction = rbind(
firm_satisfaction,
data.frame(
department = as.integer(d),
satisfaction = employee_satisfaction
)
)
}
Here is a brief summary of the data:
dep.satisf =
describeBy(firm_satisfaction$satisfaction,
group = firm_satisfaction$department,
mat = TRUE, skew = FALSE) %>%
.[,-c(1,3)]
dep.satisf %>%
kable(digits = 1, row.names = FALSE) %>%
kable_styling(full_width = FALSE)
| group1 | n | mean | sd | min | max | range | se |
|---|---|---|---|---|---|---|---|
| 1 | 7 | 3.4 | 0.4 | 2.8 | 3.8 | 1.0 | 0.2 |
| 2 | 8 | 2.7 | 0.9 | 1.1 | 3.7 | 2.6 | 0.3 |
| 3 | 7 | 2.7 | 0.8 | 1.4 | 4.1 | 2.7 | 0.3 |
| 4 | 9 | 3.5 | 0.7 | 2.7 | 4.5 | 1.8 | 0.2 |
| 5 | 9 | 3.3 | 0.8 | 2.0 | 4.5 | 2.6 | 0.3 |
| 6 | 23 | 2.9 | 1.1 | 1.0 | 5.2 | 4.2 | 0.2 |
| 7 | 25 | 3.1 | 0.9 | 1.6 | 4.7 | 3.1 | 0.2 |
| 8 | 26 | 3.3 | 0.8 | 2.5 | 5.0 | 2.4 | 0.1 |
| 9 | 25 | 2.7 | 0.9 | 1.2 | 5.2 | 4.0 | 0.2 |
| 10 | 24 | 3.2 | 1.1 | 1.4 | 5.2 | 3.8 | 0.2 |
| 11 | 102 | 3.4 | 1.0 | 0.5 | 6.0 | 5.5 | 0.1 |
| 12 | 100 | 3.2 | 1.0 | 0.5 | 5.1 | 4.6 | 0.1 |
| 13 | 101 | 2.8 | 1.1 | -0.2 | 5.0 | 5.2 | 0.1 |
| 14 | 101 | 2.4 | 1.1 | -0.1 | 6.3 | 6.3 | 0.1 |
| 15 | 99 | 3.1 | 1.1 | 0.3 | 6.0 | 5.6 | 0.1 |
A standard way to visualize these results is to use bar charts of group means, maybe together with error bars. The figure also shows the true department level satisfaction, which we usually cannot observe, as green stars.
plot_data = function(return.x = FALSE) {
set_par()
se = dep.satisf$se
x = barplot(dep.satisf$mean,
names.arg = dep.satisf$group1,
xlab = "department",
ylab = "satisfaction",
ylim = c(0, max(dep.satisf$mean+1.96*se)),
cex.names = .85)
arrows(x0 = x,
y0 = dep.satisf$mean - 1.96*se,
y1 = dep.satisf$mean + 1.96*se,
length = 0)
abline(h = mean(firm_satisfaction$satisfaction), col = "red", lty = 3)
points(x,department_mus, pch = 8, col = "green3")
if (return.x == TRUE)
return(x)
}
plot_data()
In our simulated world, the true and the measured satisfaction
differ due to random measurement error, as indicated by this line in the
simulation code:
employee_satisfaction = rnorm(n_employees[d], department_mus[d], 1)
The difference is larger in smaller departments, where the random error
is less likely to average out.
In the context of multilevel modeling, we can think of three qualitatively different approaches:
A Bayesian model to estimate identical group means (full pooling) looks as follows:
\[ S_i \sim normal(\bar \mu, \sigma) \\ \bar \mu \sim normal(3,3) \\ \sigma \sim exponential(1) \\ \]
where \(\bar \mu\) is the average satisfaction in the departments.
However, to facilitate the comparison to alternative models, we re-write the model as follows:
| Full pooling |
|---|
| \[\begin{align*} S_i \sim normal(\mu_d, \sigma) & \;\;\;\;{\small \textrm{Individuals satisfaction is department satisfaction plus error}} \\ \mu_d \sim normal(\bar \mu,.0001) & \;\;\;\;{\small \textrm{Department satisfaction is equal to average firm satisfaction}} \\ \bar \mu \sim normal(3,3) & \;\;\;\;{\small \textrm{Prior for average department satisfaction}} \\ \sigma \sim exponential(1) & \;\;\;\;{\small \textrm{Prior for error variance}} \\ \end{align*}\] |
\(\mu_d\) are department level satisfactions, which depend on the average department level satisfaction \(\bar \mu\) and the variation between departments. By setting the standard deviation for the variation between departments to 0.0001, we are enforcing that the department level means will be (for all practical purpose) equal.
Here are the corresponding ulam models:
The 2nd model implements a “non-centered parameterization”, which is a good default. When the data is very informative about the variability between groups (many observations per group) centered parameterization can be better. ( details)
Centered parameterization:
mu[d] ~ dnorm(mu_bar, sigma_mu)
Non-centered parameterization:
z[d] ~ dnorm(0,1)
mu <- mu_bar + z*sigma_mu
m.full_pooling = alist(
S ~ dnorm(mu_bar, sigma),
mu_bar ~ dnorm(3,3),
sigma ~ dexp(2)
)
m.full_pooling.b = alist(
S ~ dnorm(mu, sigma),
mu <- mu_bar + z[d]*.0001,
z[d] ~ dnorm(0,1),
mu_bar ~ dnorm(3,3),
sigma ~ dexp(2),
# calculate mu per department in generated quantities
gq> vector[d]:mus<<-mu_bar+z*.0001
)
Before we fit the model, we put the data into a list:
u.data = list(
S = firm_satisfaction$satisfaction,
d = as.integer(firm_satisfaction$department)
)
Now we estimate the models …
n.cores = ifelse(Sys.info()["sysname"] == "Darwin",4,1)
fit.full_pooling = ulam(
m.full_pooling,
data = u.data,
log_lik = TRUE, iter = 1000,
chains = 4, cores = n.cores)
fit.full_pooling.b = ulam(
m.full_pooling.b,
data = u.data,
log_lik = TRUE, iter = 1000,
chains = 4, cores = n.cores)
and check convergence:
precis(fit.full_pooling) %>%
round(2)
## mean sd 5.5% 94.5% n_eff Rhat4
## mu_bar 3.00 0.04 2.93 3.07 1427.63 1
## sigma 1.06 0.03 1.01 1.11 1630.09 1
precis(fit.full_pooling.b,
depth = 2) %>%
round(2)
## mean sd 5.5% 94.5% n_eff Rhat4
## z[1] 0.00 1.03 -1.61 1.59 2488.88 1
## z[2] 0.02 1.02 -1.61 1.66 2180.68 1
## z[3] 0.00 1.00 -1.62 1.57 2597.15 1
## z[4] 0.02 0.96 -1.52 1.53 2655.22 1
## z[5] 0.01 1.01 -1.61 1.66 2455.21 1
## z[6] -0.01 1.03 -1.66 1.58 2822.41 1
## z[7] 0.03 1.02 -1.60 1.69 2736.28 1
## z[8] -0.03 0.98 -1.63 1.60 2877.10 1
## z[9] 0.00 0.99 -1.56 1.59 2290.90 1
## z[10] -0.02 1.02 -1.68 1.60 3289.54 1
## z[11] -0.03 1.02 -1.67 1.55 2419.84 1
## z[12] 0.00 1.00 -1.59 1.63 2179.59 1
## z[13] 0.00 1.05 -1.67 1.69 2670.91 1
## z[14] 0.04 1.04 -1.60 1.73 2405.60 1
## z[15] 0.02 1.00 -1.55 1.55 2823.73 1
## mu_bar 3.00 0.04 2.94 3.07 2556.35 1
## sigma 1.06 0.03 1.01 1.11 2314.66 1
## mus[1] 3.00 0.04 2.94 3.07 2556.73 1
## mus[2] 3.00 0.04 2.94 3.07 2557.32 1
## mus[3] 3.00 0.04 2.94 3.07 2556.79 1
## mus[4] 3.00 0.04 2.94 3.07 2555.67 1
## mus[5] 3.00 0.04 2.94 3.07 2556.50 1
## mus[6] 3.00 0.04 2.94 3.07 2557.25 1
## mus[7] 3.00 0.04 2.94 3.07 2557.49 1
## mus[8] 3.00 0.04 2.94 3.07 2556.55 1
## mus[9] 3.00 0.04 2.94 3.07 2556.24 1
## mus[10] 3.00 0.04 2.94 3.07 2556.64 1
## mus[11] 3.00 0.04 2.94 3.07 2556.59 1
## mus[12] 3.00 0.04 2.94 3.07 2555.64 1
## mus[13] 3.00 0.04 2.94 3.07 2556.11 1
## mus[14] 3.00 0.04 2.94 3.07 2556.21 1
## mus[15] 3.00 0.04 2.94 3.07 2556.29 1
Do the two models really describe the data equally well, and is the number of effective parameters more or less equal?
compare(fit.full_pooling,
fit.full_pooling.b) %>%
round(2)
## WAIC SE dWAIC dSE pWAIC weight
## fit.full_pooling 1969.36 36.14 0.00 NA 1.97 0.5
## fit.full_pooling.b 1969.38 36.23 0.02 0.12 1.98 0.5
Indeed, the difference between the two models is very small. More importantly, we can see that the number of effective parameters of the two versions of the full polling model are 1.97 and 1.98, even though the models have 2 and 17 parameters, respectively. (we expect 2 effective parameters, mean and standard deviation.)
And here is our initial figure with the estimated average employee satisfactions:
x = plot_data(return.x = TRUE)
est.full_pooling = precis(fit.full_pooling.b, pars = "mus", depth = 2)
points(x,est.full_pooling$mean, col = "blue", pch = 16)
As expected we estimated the same mean for everyone
A Bayesian model to estimate independent group means looks as follows:
| Full pooling | No pooling |
|---|---|
| \[ S_i \sim normal(\mu_d, \sigma) \\ \mu_d \sim normal(\bar \mu,.0001) \\ \bar \mu \sim normal(3,3) \\ \sigma \sim exponential(1)\] | \[S_i \sim normal(\mu_d, \sigma) \\ \mu_d \sim normal(\bar \mu,\color{red}{1000}) \\ \bar \mu \sim normal(3,3) \\ \sigma \sim exponential(1)\] |
In this model, the large standard deviation of the distribution of department level satisfaction (\(\small \mu_d \sim normal(\bar \mu,1000)\)) encodes the assumption of independent groups.
Getting from a full pooling to a no polling model by changing a model parameter shows that this are not qualitatively distinct models, but opposite endpoints of a continuous model space.
The corresponding ulam model looks as follows:
m.no_pooling = alist(
S ~ dnorm(mu, sigma),
mu <- mu_bar + z[d]*1000,
z[d] ~ dnorm(0,1),
mu_bar ~ dnorm(3,3),
sigma ~ dexp(2),
gq> vector[d]:mus<<-mu_bar+z*1000
)
Now we estimate the model …
fit.no_pooling = ulam(
m.no_pooling,
data = u.data,
log_lik = TRUE, iter = 1000,
chains = 4, cores = n.cores,
)
and check convergence:
precis(fit.no_pooling, depth = 2) %>%
round(2)
## mean sd 5.5% 94.5% n_eff Rhat4
## z[1] 0.00 0.00 0.00 0.01 305.98 1.01
## z[2] 0.00 0.00 -0.01 0.00 306.50 1.01
## z[3] 0.00 0.00 -0.01 0.00 305.38 1.01
## z[4] 0.00 0.00 0.00 0.01 304.18 1.01
## z[5] 0.00 0.00 0.00 0.01 306.22 1.01
## z[6] 0.00 0.00 -0.01 0.01 301.86 1.01
## z[7] 0.00 0.00 -0.01 0.01 302.76 1.01
## z[8] 0.00 0.00 0.00 0.01 302.40 1.01
## z[9] 0.00 0.00 -0.01 0.00 302.22 1.01
## z[10] 0.00 0.00 0.00 0.01 302.57 1.01
## z[11] 0.00 0.00 0.00 0.01 302.80 1.01
## z[12] 0.00 0.00 -0.01 0.01 301.32 1.01
## z[13] 0.00 0.00 -0.01 0.00 302.02 1.01
## z[14] 0.00 0.00 -0.01 0.00 300.85 1.01
## z[15] 0.00 0.00 -0.01 0.01 301.57 1.01
## mu_bar 3.06 3.18 -2.10 8.21 301.65 1.01
## sigma 1.02 0.03 0.97 1.06 614.45 1.00
## mus[1] 3.41 0.39 2.78 4.02 3936.51 1.00
## mus[2] 2.67 0.36 2.10 3.23 4236.56 1.00
## mus[3] 2.71 0.39 2.09 3.32 3836.65 1.00
## mus[4] 3.54 0.34 3.01 4.07 4334.39 1.00
## mus[5] 3.33 0.35 2.77 3.88 4156.52 1.00
## mus[6] 2.94 0.21 2.60 3.27 5263.47 1.00
## mus[7] 3.14 0.20 2.81 3.47 4789.54 1.00
## mus[8] 3.34 0.20 3.03 3.65 5262.47 1.00
## mus[9] 2.70 0.19 2.39 3.01 5425.89 1.00
## mus[10] 3.16 0.21 2.82 3.50 5486.87 1.00
## mus[11] 3.37 0.10 3.20 3.54 3616.22 1.00
## mus[12] 3.16 0.10 3.00 3.32 4389.56 1.00
## mus[13] 2.80 0.10 2.64 2.97 4097.14 1.00
## mus[14] 2.40 0.10 2.23 2.57 4591.31 1.00
## mus[15] 3.14 0.10 2.98 3.31 4373.12 1.00
Now lets look at the true and estimated satisfaction in departments, where the estimated satisfaction from the no-pooling model is shown as blue dots:
x = plot_data(return.x = TRUE)
est.no_pooling = precis(fit.no_pooling, pars = "mus", depth = 2)
points(x,est.no_pooling$mean, col = "blue", pch = 16)
As expected, the estimates of this simple, no-pooling, Bayesian model correspond to the averages we calculated earlier, except that we see a bit of shrinkage towards the prior mean for the small departments (1-5).
Now lets compare the no pooling and full pooling model:
compare(fit.full_pooling.b,
fit.no_pooling) %>%
round(2)
## WAIC SE dWAIC dSE pWAIC weight
## fit.no_pooling 1928.68 37.62 0.0 NA 13.42 1
## fit.full_pooling.b 1969.38 36.23 40.7 16.45 1.98 0
As expected, the no pooling model is estimate to have a higher number of effective parameters (13 vs 2) than the full pooling model. The difference of 11 is not the same as the number of departments (15), but close enough.
Hierarchical regression and multilevel regression are different terms for the same concept. Both model the variation between groups of observations with random effects. Most Hierarchical regression models can also be called mixed effects models, because the include random effects and fixed effects, where the latter are assumed to be constant across groups.
In the full and no pooling analysis, we determined the the degree of pooling by setting the standard deviation for the department level satisfaction to a very low or very high number, respectively.
The key advantage of multilevel regression and partial pooling models is that, instead of setting this standard deviation to a fixed value, they estimate it as a parameter from the data. We are learning the degree of pooling from the data. This allows such models to adapt to situations where the different groups are either similar (low standard deviation) or dissimilar (high standard deviation).
Hierachical models require assumptions that can be avoided in the no-pooling case.
Here is an overview over the three model types.
| Full pooling | No pooling | Partial pooling |
|---|---|---|
|
\[ S_i \sim normal(\mu_d, \sigma) \\ \mu_d \sim normal(\bar \mu,\color{red}{.0001}) \\ \bar \mu \sim normal(3,3) \\ \sigma \sim exponential(1)\] |
\[S_i \sim normal(\mu_d, \sigma) \\ \mu_d \sim normal(\bar \mu,\color{red}{1000}) \\ \bar \mu \sim normal(3,3) \\ \sigma \sim exponential(1)\] |
\[S_i \sim normal(\mu_d, \sigma)\\ {\mu_d} \sim normal(\bar \mu,\color{red}{\sigma_{\mu}}) \\ \sigma \sim exponential(1) \\ \bar \mu \sim normal(3,3) \\ \color{red}{\sigma_{\mu} \sim exponential(0,.5)}\] |
\(\small \bar \mu\) and \(\small \sigma_{\mu}\) describe latent properties of groups of observations and are called hyperparameters. The priors for hyperparameters are often called hyperpriors.
Here is the partial pooling ulam model:
m.partial_pooling = alist(
S ~ dnorm(mu, sigma),
mu <- mu_bar + z[d] * sigma_mu,
mu_bar ~ dnorm(3,3),
sigma ~ dexp(1),
z[d] ~ dnorm(0,1),
sigma_mu ~ dhalfnorm(0,2),
gq> vector[d]:mus<<-mu_bar+z*sigma_mu
)
Now we estimate the model …
fit.partial_pooling = ulam(
m.partial_pooling,
data = u.data,
log_lik = TRUE, iter = 1000,
chains = 4
)
precis(fit.partial_pooling,depth = 2) %>%
round(2)
## mean sd 5.5% 94.5% n_eff Rhat4
## mu_bar 3.04 0.09 2.89 3.20 623.24 1.00
## sigma 1.02 0.03 0.98 1.06 2249.05 1.00
## z[1] 0.45 0.78 -0.79 1.71 2040.26 1.00
## z[2] -0.53 0.76 -1.72 0.67 2311.23 1.00
## z[3] -0.39 0.81 -1.68 0.92 2419.21 1.00
## z[4] 0.71 0.73 -0.44 1.86 2026.16 1.00
## z[5] 0.43 0.75 -0.75 1.65 2557.04 1.00
## z[6] -0.20 0.62 -1.21 0.76 1418.61 1.00
## z[7] 0.22 0.59 -0.69 1.20 1661.63 1.00
## z[8] 0.67 0.61 -0.30 1.70 1272.08 1.00
## z[9] -0.73 0.62 -1.73 0.24 1464.15 1.00
## z[10] 0.26 0.60 -0.70 1.20 1716.11 1.00
## z[11] 0.98 0.48 0.28 1.77 986.03 1.00
## z[12] 0.36 0.44 -0.31 1.06 1061.16 1.00
## z[13] -0.70 0.43 -1.38 -0.03 938.84 1.00
## z[14] -1.92 0.57 -2.88 -1.06 778.76 1.00
## z[15] 0.30 0.43 -0.37 1.00 1095.08 1.00
## sigma_mu 0.31 0.09 0.20 0.47 616.42 1.01
## mus[1] 3.19 0.25 2.81 3.60 2239.54 1.00
## mus[2] 2.88 0.23 2.50 3.24 2537.76 1.00
## mus[3] 2.92 0.25 2.51 3.31 2327.49 1.00
## mus[4] 3.26 0.23 2.91 3.66 1749.68 1.00
## mus[5] 3.17 0.23 2.81 3.57 2768.52 1.00
## mus[6] 2.98 0.18 2.70 3.27 2724.51 1.00
## mus[7] 3.11 0.16 2.85 3.37 3126.59 1.00
## mus[8] 3.24 0.17 2.98 3.52 3420.75 1.00
## mus[9] 2.82 0.17 2.53 3.09 2468.87 1.00
## mus[10] 3.12 0.17 2.84 3.39 2723.14 1.00
## mus[11] 3.33 0.10 3.18 3.49 2410.32 1.00
## mus[12] 3.15 0.10 2.99 3.29 2975.65 1.00
## mus[13] 2.83 0.09 2.68 2.98 3109.48 1.00
## mus[14] 2.47 0.10 2.31 2.63 2101.18 1.00
## mus[15] 3.13 0.10 2.97 3.28 3610.49 1.00
Let’s see if the model detected some variability in department level satisfaction:
post.sigma_mu = extract.samples(fit.partial_pooling)$sigma_mu
hist(post.sigma_mu, xlim = c(0, max(post.sigma_mu)), breaks = 30)
abline(v = sd_dep_satisfaction, lty = 2, lwd = 2, col = "green3")
The vertical green line is the true standard deviation, which shows that the model accurately estimates the variability of departments.
Models in which we estimate the between group standard deviation are called partial pooling models, because they lie somewhere on the continuum between no and full pooling.
Now lets look at the estimated means:
x = plot_data(return.x = TRUE)
est.partial_pooling = precis(fit.partial_pooling, pars = "mus", depth = 2)
points(x,est.partial_pooling$mean, col = "blue", pch = 16)
The means estimated with the multilevel / partial pooling model lie between the means of the full pooling approach (dotted red line) and the means of the no pooling approach (bar height).
We can compare the models to see which model generalizes best and to compare the number of effective parameters.
compare(fit.full_pooling.b,
fit.partial_pooling,
fit.no_pooling) %>%
round(2)
## WAIC SE dWAIC dSE pWAIC weight
## fit.partial_pooling 1923.24 37.43 0.00 NA 9.99 0.94
## fit.no_pooling 1928.68 37.62 5.43 3.05 13.42 0.06
## fit.full_pooling.b 1969.38 36.23 46.13 14.15 1.98 0.00
Model flexibility, which can lead to overfitting, is not simply a function of the number of parameters a model has. It also depends on how much parameters allow a model to fit the data. Regularization reduces a models ability to fit the data. In hierarchical models, regularization is learned from the data.
The number of effective parameters for the partial pooling model is smaller than for the no pooling model, even though it really has an additional free parameters! shrinkage reduces the number of effective parameters because it makes the group level effects more dependent than they would be in a no pooling model. We also see that the approximate out-of sample predictive accuracy is best for the partial pooling model. But what about accuracy for the current sample?
To see which method provides the best estimates of department level satisfaction, we compare each with the true department level satisfactions. Remember that the department averages calculated from individuals responses have an error due to the measurement error in the individual level data we collected:
abs.delta = rbind(
no_pooling = est.no_pooling$mean - department_mus,
full_pooling = est.full_pooling$mean - department_mus,
partial_pooling = est.partial_pooling$mean - department_mus
)
Here is a plot of these deviations:
abs.delta = abs.delta %>% abs() %>% t()
matplot(abs.delta, pch = 16, col = 2:4, xaxt = "none")
axis(1,at = 1:15, cex.axis = .85)
legend("topleft", bty = "n",
title = "Pooling",
col = 2:4,
pch = 16,
legend = gsub("_pooling","",colnames(abs.delta)))
It seems as if the partial pooling model is doing best, but it is not 100% clear.
A standard metric to calculate the performance of different models is the root means square deviation (RMSD), which we calculate now:
abs.delta^2 %>% # square deviation
colMeans() %>% # mean
sqrt() # root
## no_pooling full_pooling partial_pooling
## 0.2411478 0.2469335 0.1635485
Indeed, the partial pooling model has the smallest error. This is due to the same property of shrinkage estimators we already observed earlier in the course: Shrinkage estimators fit the data not as well as other models (here the no pooling model) but they are better at out of sample prediction.
Note though, that this is not a big surprise, because the data was generated according to the multilevel / partial pooling model.
Here is a function that produces the same plot we just saw, and that takes the between department standard deviation as an input. This allows us to test if the partial pooling model does well, even if a full or no pooling model is the true DGP.
check.accuracy = function(sd_dep_satisfaction = .25, n_departments = 15) {
fn = paste0("accuracy_",100*sd_dep_satisfaction,"_",n_departments,".Rdata")
n_employees =
rep(c(5,22,98), each = n_departments/3) +
rbinom(n_departments,n_departments/3,.5)
if (file.exists(fn)) {
load(fn)
} else {
# simulate data
department_mus =
rnorm(n_departments,
mean_firm_satisfaction,
sd_dep_satisfaction)
firm_satisfaction = c()
for (d in 1:n_departments) {
employee_satisfaction =
rnorm(n_employees[d],
department_mus[d],
1)
firm_satisfaction = rbind(
firm_satisfaction,
data.frame(
department = as.integer(d),
satisfaction = employee_satisfaction
)
)
}
#prepare ulam.data
u.data = list(
N = nrow(firm_satisfaction),
S = firm_satisfaction$satisfaction,
d = as.integer(firm_satisfaction$department))
# fit.models
fit.full_pooling = ulam(m.full_pooling.b,data = u.data,chains = 4, cores = n.cores)
fit.no_pooling = ulam(m.no_pooling,data = u.data,chains = 4, cores = n.cores)
fit.partial_pooling = ulam(m.partial_pooling,data = u.data,chains = 4, cores = n.cores)
save(fit.full_pooling,
fit.no_pooling,
fit.partial_pooling,
department_mus,u.data,
firm_satisfaction,
file = fn)
}
# get estimates
est.full_pooling = precis(fit.full_pooling, pars = "mus", depth = 2)
est.no_pooling = precis(fit.no_pooling, pars = "mus", depth = 2)
est.partial_pooling = precis(fit.partial_pooling, pars = "mus", depth = 2)
abs.delta = rbind(
no_pooling = est.no_pooling$mean - department_mus,
full_pooling = est.full_pooling$mean - department_mus,
partial_pooling = est.partial_pooling$mean - department_mus
) %>% abs() %>% t()
layout(matrix(c(1,1,2,3),ncol = 2))
RMSD = abs.delta^2 %>% colMeans() %>% sqrt()
matplot(abs.delta, pch = 16, col = 2:4, ylab = "RMSD", xaxt = "none",
main = paste0("sd = ", sd_dep_satisfaction, ", ",n_departments," groups"))
axis(1,at = 1:15, cex.axis = .7)
legend("topleft", bty = "n",
title = "Pooling",
col = 2:4,
pch = 16,
legend = gsub("_pooling","",colnames(abs.delta)))
x = barplot(RMSD, col = 2:4, ylab = "RMSD")
text(x,RMSD,round(RMSD,2),pos = 3, xpd = TRUE)
post.sigma_mu = extract.samples(fit.partial_pooling)$sigma_mu
hist(post.sigma_mu, xlim = c(0, max(post.sigma_mu)), probability = T)
curve(dnorm(x,0,2), add = T, col = "blue")
abline(v = 0, lty = 2, col = "red")
abline(v = sd_dep_satisfaction, lty = 2, lwd = 2, col = "green3")
}
Models’ performance if there is no variation between groups/departments.
set_par(cex = 1, mar=c(3,3,3,.5))
check.accuracy(sd_dep_satisfaction = .01, n_departments = 15)
Models’ performance if there is some variation between groups/departments.
set_par(cex = 1, mar=c(3,3,3,.5))
check.accuracy(sd_dep_satisfaction = 0.5, n_departments = 15)
Models’ performance if there is a lot of variation between groups/departments.
set_par(cex = 1, mar=c(3,3,3,.5))
check.accuracy(sd_dep_satisfaction = 10, n_departments = 15)
The last plot shows that if the number of groups is only moderately large and the prior for the standard deviation is relatively narrow, one can obtained biased estimates of the between group standard deviation. It is therefore especially with smaller numbers of groups important to use domain knowledge to determine reasonable priors for the standard deviation.
The partial pooling model is generally nearly as accurate as the “true” model. The model learned the right amount of pooling from the data. Obviously, this also depends on how much data is available! The last figures are based on 15 groups, but if the number of groups becomes small and each group has only few members, a partial pooling model will have difficulties estimating the standard deviation for the between group variation.
Still, it is generally a good idea to use hierarchical models when possible, if one is interested in estimating parameters for several units, because the it can be shown that pooling towards a common baseline will generally reduce the error. For more, see the James-Stein Estimator or the Stein’s paradox or example.
One issue with multilevel models (with partial pooling) is that they are not “unbiased”, which means that they do not provide an error free estimate of the sample mean.
Lets look at the means from the no pooling an partial pooling models, together with the observed sample means:
sample_means =
describeBy(firm_satisfaction$satisfaction,
group = firm_satisfaction$department,
mat = TRUE)[,"mean"]
set_par()
plot(1:15, sample_means, ylab = "mean", xlab = "department", xaxt = "none")
axis(1,at = 1:15, cex.axis = .85)
points(1:15, est.no_pooling$mean, pch = 16, col = 2)
points(1:15, est.partial_pooling$mean, pch = 16, col = 4)
abline(h = mean(u.data$S), lty = 3)
arrows(x0 = 1:15,
y0 = est.no_pooling$mean,
y1 = est.partial_pooling$mean,
col = 4, lty = 3, length = .1)
legend("topright",
pch = c(1,16,16),
col = c(1,2,4), bty = "n",
legend = c("obs. dep. mean","no pooling", "partial pooling"))
The deviation between sample mean and partial pooling estimate
shows that the latter is a biased estimate of the sample mean. This bias
is larger for smaller departments (on the left hand side) which which
are shrunk more, and it is also larger for departments with extreme
means, compare e.g. departments 13 and 14.
Now, one could be concerned that partial pooling makes it difficult to detect differences between groups. But the bias just described is only part of the story. Next, we plot the estimates together with their credible intervals:
set_par()
ylim = range(c(est.no_pooling[,3:4],est.partial_pooling[, 3:4]))
plot(1:15, sample_means, ylab = "mean", xlab = "department", ylim = ylim, xaxt = "none")
axis(1,at = 1:15, cex.axis = .85)
points((1:15)-.1, est.no_pooling$mean, pch = 16, col = 2)
points((1:15)+.1, est.partial_pooling$mean, pch = 16, col = 4)
abline(h = mean(u.data$S), lty = 3)
arrows(x0 = (1:15)-.15,
y0 = est.no_pooling$`5.5%`,
y1 = est.no_pooling$`94.5%`,
col = 2, length = 0)
arrows(x0 = (1:15)+.15,
y0 = est.partial_pooling$`5.5%`,
y1 = est.partial_pooling$`94.5%`,
col = 4, length = 0)
legend("topright",
pch = c(1,16,16),
col = c(1,2,4), bty = "n",
legend = c("obs. dep. mean","no pooling", "partial pooling"))
What do you see?
The variance of the estimates from the hierarchical partial pooling model is generally smaller, especially for smaller departments. The intuition is that because the estimate for one department also relies partially on data from other departments, the estimated mean for the department was derived from more data and is therefore less uncertain.
This reduction in variance means that even though all estimates were shrunk towards the global mean and group differences became smaller, we can still detect group differences well due to the reduces variance of the estimates for the group mean.
More generally, it can be said that total error of an estimate can be decomposed into:
VB = function(truth, estimate) {
return(c(
variance = mean((estimate-mean(estimate))^2),
bias = mean((truth-mean(estimate))^2)
))}
Shrinkage methods and in particular multilevel regression and partial pooling make a different trade off than the more traditional no-pooling estimate by exchanging some bias for a reduction in variance.
The following figure shows how partial pooling trades off variance for bias:
VB.partial_pooling =
do.call(rbind,lapply(1:15, function(d) {
VB(sample_means[d],extract_post_ulam(fit.partial_pooling,pars = "mus")[[1]][,d])
}))
VB.no_pooling =
do.call(rbind,lapply(1:15, function(d) {
VB(sample_means[d],extract_post_ulam(fit.no_pooling,pars = "mus")[[1]][,d])
}))
set_par()
xlim = range(c(VB.partial_pooling[,1],VB.no_pooling[,1]))
ylim = range(c(VB.partial_pooling[,2],VB.no_pooling[,2]))
plot(0, type = "n", col = 4,
ylim = ylim, xlim = xlim,
ylab = "Bias", xlab = "Variance")
arrows(x0 = VB.no_pooling[,1],
y0 = VB.no_pooling[,2],
x1 = VB.partial_pooling[,1],
y1 = VB.partial_pooling[,2],
length = .1, col = "grey")
text(VB.partial_pooling[,1], VB.partial_pooling[,2], 1:15, col = 4)
text(VB.no_pooling[,1], VB.no_pooling[,2], 1:15, col = 2)
legend("topright", bty = "n",
col = c(2,4), pch = 16,
legend = c("no pooling","partial pooling"))
Hence, even if a hierarchical model means that sub-group estimates are pulled toward the grand mean, we can still detect group difference well because the group means are estimated more precisely.
Rewrite the following model as a multilevel model
| No pooling | Partial pooling |
|---|---|
| \[y_i \sim Binomial(1,p_i) \\ logit(p_i) \sim \alpha_{GROUP[i]} + \beta x_i \\ \color{red}{\alpha_{GROUP} \sim Normal(0,1.5)} \\ \beta \sim Normal(0,0.5)\] | \[ y_i \sim Binomial(1,p_i) \\ logit(p_i) \sim \alpha_{GROUP[i]} + \beta x_i \\ \color{red}{\alpha_{GROUP} \sim Normal(\bar \alpha, \sigma_\alpha)} \\ \beta \sim Normal(0,0.5) \\ \color{red}{\bar \alpha \sim Normal(0,1.5)} \\ \color{red}{\sigma_\alpha \sim Exponential(0,1)} \] |
Rewrite the following model as a multilevel model
| No pooling | Partial pooling |
|---|---|
| \[y_i \sim Normal(\mu_i,\sigma) \\ \mu_i \sim \alpha_{GROUP[i]} + \beta x_i \\ \color{red}{\alpha_{GROUP} \sim Normal(0,5)} \\ \beta \sim Normal(0,1) \\ \sigma \sim Exponential(1) \] | \[y_i \sim Normal(\mu_i,\sigma) \\ \mu_i \sim \alpha_{GROUP[i]} + \beta x_i \\ \color{red}{\alpha_{GROUP} \sim Normal(\bar \alpha, \sigma_\alpha)} \\ \beta \sim Normal(0,1) \\ \sigma \sim Exponential(1) \\ \color{red}{\bar \alpha \sim Normal(0,5)} \\ \color{red}{\sigma_\alpha \sim Exponential(1)} \] |
The Normal and the Cauchy distributions:
set_par(mfrow = c(2,1))
curve(dnorm(x),-7,7,ylab = "density")
curve(dcauchy(x), col = "blue", add = T, n = 500)
legend("topleft",lty = 1,bty = "n",
col = c("black","blue"),
legend = c("Normal","Cauchy"))
curve(dcauchy(x)/dnorm(x), -7,7, col = "red", n = 500, log = "y")
The Student-t and the Cauchy distributions:
set_par()
curve(dnorm(x),-7,7,ylab = "density", col = "grey")
curve(dstudent(x,nu = 2), add = T, col = "green3", n = 500)
#curve(dstudent(x,nu = 10), add = T, col = "green3",n = 500, lty = 3)
curve(dcauchy(x), col = "blue", add = T, n = 500)
legend("topleft",lty = c(1,1,1),bty = "n",
col = c("green3","blue","gray"),
legend = c("Student-t, df = 2","Cauchy", "Normal"))
Non centered parameterization to solve divergent iterations
| centered | non-centered |
|---|---|
alist( S ~ dbinom(N,p), logit(p) <- a[tank], a[tank] ~ dstudent(2,a_bar,sigma), a_bar ~ dnorm(0,1.5), sigma ~ dexp(1) ) |
alist( S ~ dbinom(N,p), logit(p) <- a[tank], a <- a_bar + z*sigma, a_bar ~ dnorm(0,1.5), sigma ~ dexp(1), z[tank] ~ dstudent(2,0,1) ) |
As we have seen earlier, extreme values are more probable under the Cauchy distribution.
If we then have the data \(y = 0\) and use the model
\[ y \sim Normal(\mu,1) \\ \mu \sim Normal(10,1) \]
If we use the model
\[ y \sim Normal(\mu,1) \\ \mu \sim Student(2,10,1) \]
If we use the model
\[ y \sim Student(2,\mu,1) \\ \mu \sim Normal(10,1) \]
If we use the model
\[ y \sim Student(2,\mu,1) \\ \mu \sim Student(2,10,1) \]